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Abstract —This paper summarizes the design of a pro- 
grammahie processor with transport triggered architecture 
(TTA) for decoding LDPC and turho codes. The processor archi¬ 
tecture is designed in such a manner that it can be programmed 
for LDPC or turbo decoding for the purpose of internetworking 
and roaming between different networks. The standard treiiis 
based maximnm a posteriori (MAP) aigorithm is used for 
turbo decoding. Uniike most other impiementations, a supercode 
based sum-product aigorithm is used for the check node message 
computation for LDPC decoding. This approach ensnres the 
highest hardware utiiization of the processor architecture for the 
two different aigorithms. Up to our knowiedge, this is the first 
attempt to design a TTA processor for the LDPC decoder. The 
processor is programmed with a high ievei ianguage to meet the 
time-to-market reqnirement. The optimization techniques and the 
usage of the function units for both aigorithms are expiained in 
detaii. The processor achieves 22.64 Mbps throughput for turbo 
decoding with a singie iteration and 10.12 Mbps throughput for 
LDPC decoding with five iterations for a ciock frequency of 200 
MHz. 

I. Introduction 

The forward error correction (EEC) scheme is one of the 
integral parts of the wireless systems. The turbo coding scheme 
[1] has been adopted for the air interface standard called 
Long Term Evolution (LTE), that has been defined by the 
3rd Generation Partnership Project (3GPP) [2]. Low-density 
parity-check codes (LDPC) [3] are also gaining popularity as 
it has been chosen for IEEE 802.1 In WLAN systems [4], 
IEEE 802.16e WiMAX systems [5] and DVB-S2 [6]. Due to 
their excellent performance, the turbo and LDPC codes are the 
primary candidates for EEC scheme for the next generation 
communication systems. The roaming between WLAN and 
LTE systems requires a multimode EEC support. Therefore, a 
decoder which is able to support LDPC and turbo would be 
beneficial. 

The hardware designs of turbo and LDPC decoders are at a 
matured stage due to extensive efforts of researchers. Some of 
the efficient hardware implementations of turbo decoder can 
be found in [7] and [8] and some of the efficient hardware 
implementations of LDPC decoders can be found in [9] and 
[10] etc. Sun and Cavallaro [11] even designed multimode 
decoders as pure hardware designs. The hardware implemen¬ 
tations provide high throughput, but the development time is 
not as rapid as processor based implementations. Besides, the 


hardware implementations suffers from inflexibility. Therefore, 
the hardware implementation of a multimode decoder might 
not be useful for other purposes. The design presented in this 
paper has the potential to be used as a detector or equalizer 
running on factor graphs, for example. 

The software implementations provide the required flexi¬ 
bility to support a multimode decoder, but requires a careful 
design to achieve the target throughput. Programmable ac¬ 
celerators, which enable software-hardware co-design method 
might be an attractive solution to overcome these bottlenecks. 

Several application-specific instruction-set processors 
(ASIP) for multimode decoders with high throughput have 
been designed in [12], [13] and [14]. However, all of the 
ASIPs have been programmed with low level language which 
does not meet the time-to-market requirement. Besides, the 
utilization of the same function units for both decoding 
algorithms has not been described explicitly. 

The design of software and hardware together to grind out 
the best performance and to ensure programmability is not 
a straightforward task. The designer needs a very efficient 
tool, which can be used to design the processor easily for 
a particular application. 

In this paper, we propose a design of a processor based 
on the transport triggered architecture (TTA) for turbo and 
LDPC decoder. TTA is a very good processor template for 
a programmable ASIP. The TTA based codesign environment 
(TCE) tool enables the designer to write an application with 
a high level language and design the target processor in a 
graphical user interface at the same time [15]. 

Up to our knowledge, this is the first attempt to design 
a TTA processor for the LDPC decoder. The turbo decoder 
with TTA has been designed by Salmela et al. [16] and 
Shahabuddin et al. [17]. As a TTA processor can be best 
utilized to support different algorithms, a unified processor for 
turbo and LDPC decoding is the natural research direction. 

The max-log-MAP algorithm is used for the component 
decoders of turbo decoding. The parity-check matrix of size 
{M,N) for LDPC decoding is decomposed into M rows 
of two state trellises or supercodes. The trellis based sum- 
product algorithm is used on these supercodes for check node 
calculation. The processor achieves 22.64 Mbps throughput 
for turbo decoding with a single iteration and 10.12 Mbps 


throughput for LDPC decoding with five iterations. 

The rest of the paper is organized in the following way: 
In Sections |II] and [111] an overview of the turbo decoding and 
LDPC decoding algorithms is presented. In Section IIVI the 
simplification techniques and the similarities between turbo 
and LDPC decoding algorithm are presented. The common 
special function unit design is presented in Section |V] The 
processor design has been presented in Section |VT] In Section 
I VIII the throughput results and comparison with other imple¬ 
mentations are given. The conclusion is given in Section IVIIII 

IT Review of Turbo Decoding 
A. Turbo Decoding 

The turbo decoder consists of two soft-input soft-output 
(SISO) decoders, with interleavers and de-interleavers between 
them as shown in Fig. 1. The inputs of the turbo decoder come 
from the soft demodulator, which produces the log-likelihood 
ratios (LLR) for the systematic bits and the parity bits. The 
LLRs of the systematic bit, Lul and first parity bits, Lcll 
goes to the first SISO decoder. The SISO decoder produces 
soft outputs based on these LLRs. These soft outputs are used 
in the second SISO decoder as the additional information. The 
inputs of the second SISO decoder are the LLRs coming from 
the systematic bits, second parity bits denoted by LcI2 and 
output of the first SISO decoder. The LLRs of the systematic 
bits are scrambled this time with the same interleaving pattern 
used at the encoder. Similarly, the soft outputs coming from 
the first SISO decoder are scrambled also with the same 
interleaving pattern, which are used as a priori values for 
the second SISO decoder. 



Fig. 1. Block diagram of the turbo decoder. 


The heart of the turbo coding is the iterative decoding 
procedure. The output of the second SISO decoder does not 
produce the hard outputs immediately, but the soft output 
is used again in the first SISO decoder for more accurate 
approximation. The process continues in a similar fashion in 
an iterative manner. A single iteration by both the first and the 


second SISO decoder is referred to as a full iteration. On the 
other hand, the operation performed by a single SISO decoder 
can be referred to as a half iteration. At the beginning of the 
first iteration, the a priori values are set at zero. Six to eight 
full iterations are used to achieve sufficient performance [1]. 

B. MAP Algorithm for Component Decoder 

The MAP algorithm for the component decoder applied here 
has been proposed by Benedetto et al. [18]. The algorithm can 
be stated like: 

1. Initialize the values of the forward state metric as ao(s) = 
0 if s = S'o and ao(s) = —oo otherwise. 

2. Calculate all the forward state metric of the same window 
through the forward recursion according to 

cifc('S) = max [s'^(e)] + u{e)LuI[k — 1] 

« (1) 
+ ci{e)LcI 1 [k — 1] + C2{e)LcI2[k — !])■ 

3. Initialize the values of the backward state metric as 
/3„(s) = 0 if s = S'„ and /3„(s) = —cx) otherwise. 

4. Calculate all the backward state metric of the same 
window through the backward recursion as 

/3fe(s) = max*(^fe+i[s-®(e)] -f u{e)LuI[k + 1] 

" (2) 

+ ci{e)LcIl[k -I- f ] -f C 2 {e)LcI 2 [k + f ]). 

5. The LLR values for the information and both parity bits 
can be calculated as following: 

LLR{.] O) = max*(afe_i[s'®(e)] + ci{e)LcIl [k — 

^ (3) 

-I- C2{e)LcI2[k - 1]+ I5k[s^{e)]). 

For max-log-MAP algorithm, max*(a:, y) sa max(a:, y) [19]. 
The decoding is done in smaller windows so that the decoding 
process can be done in parallel and the decoder does not 
have to wait for the whole block to arrive before starting the 
decoding process. This windowing is sometimes referred to as 
a sliding window method. 

III. Review of LDPC Decoding 
A. Quasi-Cyclic LDPC Codes 

LDPC codes are linear block codes which consist of code¬ 
words satisfying the parity-check equation 

Hx^ = 0, (4) 

where H is the parity-check matrix and x is a codeword. The 
parity-check matrix H is ‘sparse’ or consists of a small number 
of non-zero entries in case of LDPC codes. 

The non-zero entries of the parity check matrix H are 
usually distributed pseudo-randomly according to some distri¬ 
bution. Although this pseudo-random distribution leads to very 
good FER performance, it makes the encoding and decoding 
of LDPC codes difficult. Therefore, a structure is imposed on 
H to ease the encoding and decoding by slightly sacrificing 
the performance. A good trade-off between complexity and 
performance is provided by the quasi-cyclic (QC) LDPC codes 
[20]. 




























The parity check matrix of the QC-LDPC consists of square 
blocks which are either all zero matrix or cyclic shifts of the 
identity matrix. This structure of the parity check matrix leads 
to efficient encoding and decoding architectures. Due to their 
architecture aware construction, QC-LDPC codes have been 
adopted by several wireless standards such as IEEE 802.1 In, 
IEEE 802.16e and DVB-S2. 


B. Decoding of LDPC Codes 

LDPC codes can be visualized by a bipartite graph consist¬ 
ing of check and variable nodes which represent the rows and 
columns of the parity check matrix respectively. 

The decoding algorithm of LDPC codes is described as a 
message passing algorithm running on this graph. Alternative 
LDPC decoding algorithms differ basically in two aspects 
which are message computation at the check nodes and mes¬ 
sage flow schedules. In LDPC decoding, messages are almost 
always represented in LLR domain to make the decoding 
numerically stable. 
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Fig. 2. Super codes from the parity-check matrix of LDPC codes. 


The sum-product algorithm is employed at the check nodes 
in exact or approximate fashion. An approximation, which is 
proposed in [ 21 ], computes outgoing messages by finding the 
minimum of the absolute values of a subset of the incoming 
messages. Although this approximation is very popular in 
the LDPC decoding literature, the hardware it imposes is 
not useful for turbo decoding. The hardware for computing 
the minimum and absolute values are redundant for turbo 
decoding. Therefore, using this approximation reduces the 
hardware reusability. Hence, we use the forward-backward 
algorithm running on a binary trellis similar to [9] and [11]. 
This algorithm is derived by decomposing {M,N) parity 
check matrix into M binary or two-state trellises, which can 
also be called supercodes. The supercodes are shown in Pig. 

2. The algorithm can be described for a check node of weight 
Z as follows. 

1. The forward and backward recursion metrics are initial¬ 
ized as 


where s and b are binary variables, © denotes the binary 
addition, and Lik denotes the incoming message from the 
neighbor. 

3. Backward recursion metrics are computed for 
k = Z — 1, Z — 2,... , \ as 

/3fc(s) = max* {/3fe+i(s © 6) + {-ifLik+i) ■ (6) 

4. Pinally the outgoing message to the neighbor is given 
by 

^ (max* {afc_i(s) +/3fc(s)} 

-max* {Q;fc_i(s)+/3fc(s © 1)}^ (7) 

Notice that the steps 3 and 4 can be carried in the same run 
to get rid of storing /3fc(.). 

We prefer layered schedule [22] as the message flow sched¬ 
ule. This schedule can be formally described as follows. 

1. Initialize A{n) to \{n) for n = 1, 2,..., W where A(n) 
denotes the LLR of the bit received from the channel and 
N is the block length of the code. 

2. Por each row repeat the following. 

2.a Assign Liu = A{'Kj{k)) — Lpj^k where 'Kjfk) denotes 
location of the 1 on the row of H and Lpj.fc is 
the outgoing message computed by row in the previous 
iteration for the bit. Por the first iteration take Lpj^k 

as 0. 

2.b. Compute the outgoing messages according to the algo¬ 
rithm above for the row. 

2. C. Update A{Trj{k)) as A('nj(k)) -S— A{'Kj(k)) + Lok and 
assign Lpj^k = Lok to use in the next iteration. 

3. Goto Step 2 until a certain number of iteration. 

4. Ain) holds the estimated LLR’s from the LDPC decoding. 

IV. Shared Calculations between Turbo and 
LDPC Decoding 

There are 16 branch metric computations between two states 
for forward metric, backward metric and LLR calculations in 
the trellis diagram of an eight state convolutional code. 

Prom the trellis structure of the 3GPP turbo code in Pig. 
3, it can be seen that four calculations of branch metric are 
being repeated to result in total sixteen calculations. The four 
calculations can be expressed as 

7i = Lul + Lcl 1 + LcI2, 

72 = —Lul — Lcl 1 + LcI2, 

( 8 ) 

73 = Lul + Lcl 1 — LcI2, 

74 = —Lul — Lcl 1 — LcI2, 


ao(0) = Pzi0) = 0. 
q;o( 1) = /3z(l) = -oo. 

2. Porward recursion metrics are computed for k = 
1 ,2 ,..., Z — 1 as 

afe(s) = max* {Q;fc_i(s © 6) + } , (5) 


where 74 can be represented as —71 and 73 can be represented 
as — 72 . Therefore, it is sufficient to calculate 71 and 72 only. 

A. Forward Metric calculation 

The trellis of the 3GPP turbo code can be divided into four 
butterfly pairs. Using the branch metric values given above. 
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Fig. 3. Trellis of 3GPP turbo code. 


the forward metric calculation of a butterfly pair can be given 
as 

Q;fc(l) = max*(Q;fc-i(l) - 7 fc_i (1), afc_i ( 2 ) + 7 fe_i(l)). 

afc(5) = max*(Q;fc_i(l) + 7 fc_i(1), afc_i(2) - 7 fe_i(l)). 

(9) 

The forward metric calculation for a supercode in LDPC 
decoding is also similar. 

akil) = max*(Q;fc-i( 0 ) - Lik-i,ak-i(l) + Lik-i). 

afc(O) = max*(Q;fc-i(0) + Lik-i,ak-i{l) - Lik-i). 

The value of the branch metric in LDPC decoding equals 
to a single LLR value between two time instances. On the 
other hand, the branch metric value in turbo decoding is a 
combination of three LLR values. 

B. Backward Metric and LLR calculation 

The backward metric calculation is also similar for turbo 
and LDPC. For turbo decoding, the backward metric calcula¬ 
tion of a butterfly pair can be presented as 

/3fc(l) = max*(/3fc+i(l) - 7 fe+i(l),/ 3 fc+i( 5 ) + 7 fc+i(l)). 

/3fc(2) = max*(/3fe+i(l) -f 7 fe+i(l),/ 3 fc+i( 5 ) - 7 fe+i(l)). 

( 11 ) 


/3fe(l) = max*(/3fe+i(0) - Ltfc+i,/3fc+i(l) + Lik+i). 

/3fe(0) = max*(/3fe+i(0) + Lifc+i,/3fc+i(l) - Lik+i). 

The operations needed to calculate the forward and back¬ 
ward metric is similar. However, the output LLR computation 
is different for the algorithms. The output LLR of turbo 
involves eight forward metric, eight backward metrics and all 
the branch metric in between. The calculation can be presented 
as 


LLRk = max(Q;fc_i(l) -F /3fc(l) -F Jiik), afc-i(2) -F /3fc(5) -F 7i(^)> 
Q!fc-i(7) -F /3fc(8) -F 7i(^)i Ofc— 1 ( 8 ) -F /3fc(4) -F 71 (fc), 
Q!fc-i(3) -F /3fc(6) -F 72 (^)i Ofc— 1 ( 4 ) -F /3fc(2) -F ') 2 (k), 
Q!fc-i(5) -F /3fc(3) -F 72(fc), afc-i(6) -F /3fc(7) -F 72(fc)) 
-max(Q;fc_i(l) -F /3fc(5) - 7 i(fc), afc-i(2) -F /3fc(l) - 71 (fc), 
Q!fc-i(5) -F /3fc(7) - 7 i(fc), afc-i( 6 ) -F /3fc(3) - 71 (fc), 
Q;fc_i(3) -F /3fc(2) - 72 (fc), afc-i(4) -F /3fc(6) -F ^2(k), 
ak-i{7) + /3fc(4) - 72 (fc), afc-i( 8 ) -F /9fc(8) -F ^2(k)). 

(13) 

On the other hand, the LLR calculation of the super code 
in LDPC is simple. 

LLRk = i(max(afc_i(0)-F/3fc(0),afe_i(l)-F/3fc(l)) 
-max(afe_i(0) -F /3fe(l), (1) -F /3fe(0))). 

The output LLR calculation of turbo decoding needs to be 
divided into four parts to make the calculations similar. 

V. Special Function Unit Design 

A. ALPHA Special Function Unit 

A function unit can be made with three inputs and two 
outputs to compute the forward and backward metric. In turbo 
case, the unit can use afe_i(l), afc-i( 2 ) and 7 fc_i(l) as inputs 
and compute the outputs of afc (1) and ak (5). In case of LDPC, 
the same unit can use ak-i{0), afc_i(l) and LLR as inputs 
and compute afc(O) and Q;fc(l) as outputs. The same function 
unit can be used for both of the cases because the operations 
are same as can be seen from (9) and (10). 

One of the ALPHA special function units calculates two 
forward metric values based on two earlier state forward metric 
in the same butterfly pair. Therefore four ALPHA unit can 
calculate all the necessary forward metric values for one time 
instant. A block diagram of the ALPHA unit used for LDPC 
decoding is presented in Fig. 4. 

On the other hand, the LDPC can utilize these four units 
by processing four supercodes parallely. 

B. BetaLLR Special Function Unit 

The backward metric and the LLR is computed together 
to reduce memory requirement. For a single algorithm it is 
easier to design a unit for beta separately and LLR separately. 
However, the LLR calcualtion of LDPC and turbo is not the 









ak-i(O) 


Lik-i 


Clk l(l) 



Fig. 4. ALPHA unit for a single butterfly pair. 


same. It can be seen from (14) that we can calculate with a 
special function unit the two maximization properties as 

outputl =max(Q;fc-i(0)+/3fc(0),afc_i(l)+/3fc(l))- 
output2 = max(Q;fc-i(0) +/3fc(l),afe_i(l) +/3fc(0)). 

The unit would calculate the earlier state backward metrics 
/3fc(0) and /3fc(l) based on /3fe+i(0) and /3fe+i(l). 

Therefore, the BetaLLR unit takes five inputs and produce 
four outputs. For example, the unit takes /3fc+i(0), /3fe+i(l), 
Q;fc_i(0), Q;fc_i(l) and Lik+i as inputs and can produce the 
earlier state backward metrics /3fe(0), /3fc(l), outputl and 
outputl of (15). A block diagram of the unit is given in Fig. 
5. 
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Fig. 5. BetaLLR unit for a single butterfly pair. 


Equation (13) has to be divided in some similar form to 
utilize the BetaLLR unit to compute the backward metric and 
output LLR for turbo decoding. 


Equation (13) can be expressed as 
LLRk = max(max(afc-i(l) + Q;fc-i( 2 ) + /3fc(5)) + Ji{k), 

max(afc-i(7) + /3k{8), Q;fc_i( 8 ) + /3fc(4)) + 7 i(fc), 
max(afc-i(3) + /3k{6),ak-i{‘i) + /3fc(2)) + 72 )^), 
max(afc_i(5) + /3fc(3), afc-i( 6 ) + /3fc(7)) + 72 (fc)) 
-max(max(afc_i(l) + l3k{5),ak-i{2) + /3fc(l)) - 7 i(fc), 
max(afc_i(5) + I3k{7),ak-i{6) + /3fc(3)) - 7 i(fc), 
max(afc_i(3) + I3k{2),ak-i{‘i) + /3fc(6)) + 72 (fe), 
max(afc_i(7) + ;3fc(4), afc_i( 8 ) +/3fc(8) + 72 )^)). 

(16) 

We can divide (16) in four parts to use the BetaLLR unit. 
Lor example, one of the parts is given here as 

outputl =max(afc_i(l) + /3fc(l),afc_i(2)+/3fc(5)). 

outputl = max(afc_i(l) + /3fc(5), Q;fc_i(2) +/3fc(l)). 

The branch metric 7 needs to be added or subtracted with 
the left side of (17) and have to use maximization unit to get 
the final LLR output. A block diagram is given in Lig. 6 to 
calculate a LLR in turbo mode. 



Fig. 6. LLR computation with four BetaLLR units in turbo mode. 


Lour BetaLLR unit can be utilized in LDPC mode by 
processing four supercodes parallely. 

VI. Transport Triggered Architecture Processor 
A. Top level architecture 

A part of the TTA processor designed for the LDPC and 
turbo decoding is illustrated in Pig. 7. Lor readability, the 
whole processor figure is not given. The blocks on the upper 
part of the figure represent the function units and register files 
of the processor. The black horizontal straight lines represent 
the buses of the processor. The vertical rectangular blocks 
represent the sockets. The connection between function units 
and buses is illustrated by black dots in the sockets. 

The fixed point processor includes load/store unit (LSU), 
arithmetic logic unit (ALU), global control unit (GCU) and 
register files. Based on the resource requirements in high level 
language, function units and register files are added. 
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Fig. 7. Implemented processor with reduced number of function units. 


The ALU unit is used to perform the basic arithmetic op¬ 
erations like addition, subtraction etc. Operations like shifting 
right or left are also included in ALU. 

For turbo decoding, the forward and backward metric 
computations between two states need at least four of the 
ALPHA and BetaLLR units. Therefore, four ALPHA and four 
BetaLLR units are used in the processor. 

LSU units are used to support the memory accesses. The 
LSU units are used to read and write memory. The memory 
can be read in three clock cycles and can be written in a single 
cycle. 

Several register files are used to save the intermediate 
results. In terms of the power consumption, registers can 
be more expensive than memory, but to meet the latency 
requirement register files are needed. A single Boolean register 
file has been included in the processor design. 

Thirty buses have been used for the processor. The number 
of buses is crucial to ensure the parallel processing. However, 
the complexity also increases with the increased number of 
buses. 

The LLR outputs have been written using a first-in-first-out 
(FIFO) buffer by using the function unit called STREAM. The 
STREAM units can write every output sample in three clock 
cycle. 

B. Programming the processor 

The processor is programmed with high level language C. 
Several macros have been used to call the function units and 
use part of that code with that specific function unit. 

The turbo decoding algorithm use three blocks of 6,144 in¬ 
put LLRs. The blocks have been divided in smaller windows to 
save the memory requirements. Only the forward metrics have 
been saved in a window. The backward metric is calculated 


with the BetaLLR unit and used immediately to calculate the 
corresponding LLR. 

The forward metric and backward metrics increase in each 
step and that is why the forward and backward metrics is 
normalized to avoid memory overflow. 

Before processing the ALPHA and BetaLLR values, the 
7 values need to be calculated. As shown in figure 7, The 
output of the BetaLLR also needs to maximize and added or 
subtracted with 7 values to find the output LLR. 

In the LDPC mode, the processor is programmed for the 
LDPC code of IEEE WLAN 802.1 In of code rate 1/2 and 
output block size of 648. Due to the data dependencies, a sin¬ 
gle special function unit is used several times to calculate the 
required forward and backward metric values of a supercode 
in serial fashion. Eor example, the first row of the H matrix of 
this particular code configuration has seven nonzero elements. 
The two-state trellis should be a matrix of 8 x 2 to calculate the 
forward and backward metrices for this row. The initilization 
values of the metrices are zero. Therefore, only one ALPHA 
and one BetaLLR units are used seven times each to get all the 
necessary output LLR from this row. Pour of this rows can be 
processed in parallel as there are four ALPHA and BetaLLR 
units available. 

The variable node update is done by simply adding the LLR 
outputs of the super codes with the corresponding original 
LLRs of the same position. Shifting operations are required to 

VH. Results and Discussion 

The designed processor takes 166,224 clock cycles to pro¬ 
cess three blocks of 6,144 samples for a full iteration for the 
turbo decoding. The processor takes 10,368 clock cycles to 
decode a LDPC code for IEEE WLAN 802.1 In of block size 
648 and code rate 1/2 after five iterations. 











The throughput can be calculated using the following equa¬ 
tion as 

^ Size of the code block x device clock frequency 

Throughput = - 

latency x number of iterations 

( 18 ) 

The throughput achieved for the turbo mode is 22.64 Mbps 
for a single iteration and for LDPC mode 10.12 Mbps for five 
iterations for a clock frequency of 200 MHz. 

The buses of the processor are perfectly utilized to achieve 
the best possible result due to the perfect scheduling. The num¬ 
ber of some of the operations during the algorithm execution 
has been summarized in Table |T] 

TABLE I 

Number of operations 


Operation 

# of OPS in turbo 

# of OPS in LDPC 

ADD 

431,009 

87,134 

SUB 

96,354 

14,231 

MAX 

43,008 

0 

ALPHA 

24,576 

2,376 

BetaLLR 

24,576 

2,376 

STREAM 

6,144 

648 


The number of addition operations does not only represent 
the addition for the algorithm, but for several other purposes 
like loop indexing for the code. The maximization units are 
not used in case of LDPC decoding because the maximization 
operations are done inside the ALPHA and BetaLLR units. 

A comparison with different other programmable implemen¬ 
tations of turbo decoder has been presented in Table [II] The 
throughput results are normalized for a clock frequency of 
200 MHz. Our proposed processor with turbo mode provides 
very good throughput compared to other programmable im¬ 
plementations. The TTA processors of [16] and [17] provide 
higher throughputs but the designs were dedicated for only 
turbo decoding. 


TABLE II 

Programmable Turbo Processors 


Reference 

Architecture 

Algorithm 

Throughput 

[23] 

TMS320C6201 DSP 

max-log-MAP 

2 Mbps 

[24] 

VLIW ASIP 

max-log-MAP 

5 Mbps 

proposed 

TTA proc. in turbo mode 

max-log-MAP 

22.64 Mbps 

[17] 

TTA proc. for LTE 

max-log-MAP 

31.21 Mbps 

[25] 

Nvidia Cl060 

max-log-MAP 

33.85 Mbps 

[16] 

TTA proc. 

max-log-MAP 

98 Mbps 


A comparison with different other programmable implemen¬ 
tations of LDPC decoder has been presented in Table HII] The 
throughput results are normalized for a clock frequency of 
200 MHz. Our proposed processor with LDPC mode provides 
moderate throughput compared to most of the programmable 
implementations. 

Alles et al. presented an efficient implementation of multi- 
mode decoder in [12]. The ASIP achieved 34.5 Mbps to 257 
Mbps for LDPC codes of different code configurations and 
block size of IEEE 802.1 In when the clock frequency is 400 
MHz. The lowest throughput of 34.5 Mbps at 400 MHz clock 


TABLE III 

Programmable LDPC processors 


Reference 

Architecture 

Algorithm 

Thi'oughput 

[26] 

TMS320C64XX 

DSP 

min-sum 

1.8 Mbps @ 10 
it. 

proposed 

TTA proc. for 
LDPC 

supercode based 
sum-product 

10.12 Mbps @ 5 
it. 

[27] 

SDR SODA 

min-sum 

15.2 Mbps @ 10 
it. 

[14] 

VLIW ASIP 

offset min-sum 

53 Mbps @ 10 it. 

[12] 

VLfW ASIP 

offset min-sum 

16.32 - 128.5 

Mbps @ 10-20 
it. 


frequency after 20 iterations was achieved when the code rate 
and block size is the same as presented in this paper. The 
design of [14] achieved high throughput with a different code 
configuration. Besides, all the implementations presented here 
used the assembly language. 

VIII. Conclusion 

The paper discussed the design issues of a turbo and LDPC 
decoder on a TTA processor. The design shows the promise 
of the possibility of designing several decoding techniques on 
a single TTA processor. The processor designed in this paper 
can be used for tasks beyond decoding, for instance, it can be 
programmed for detection and equalization algorithms running 
on factor graphs [28]. The target throughput of LTE can also 
be reached by multi-core TTA processor. The flexibility gained 
from that processor could provide very interesting results and 
would be a fruitful direction for future research. 
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